Dataset

The dataset is called USDA National Nutrient Database. This dataset contains 8790 foods and 51 nutrition profiles of those food. All the measures are conducted on 100g of each food.

library(readxl)
library(knitr)
a <- read_excel("~/Downloads/sr28abxl/ABBREV.xlsx")
kable(head(a,10))
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calcium_(mg) Iron_(mg) Magnesium_(mg) Phosphorus_(mg) Potassium_(mg) Sodium_(mg) Zinc_(mg) Copper_mg) Manganese_(mg) Selenium_(<U+00 B5>g) Vit_C _(mg) Thiamin _(mg) Riboflavin _(mg) Niacin _(mg) Panto_Aci d_mg) Vit_B6 _(mg) Folate_Tot _(g) Fol ic_Acid_(g) Food_Folate_(<U +00B5>g) Folate_DF E_(g) Choline_To t_ (mg) Vi t_B12_( g) Vit_A_IU Vit A_RAE Retinol( g) Alpha Carot( g) Beta_Carot_ (g) Beta_Crypt _(g ) Lycopene_ (g) Lut+Zea_ (<U+ 00B5>g) Vit_ E_(mg) Vit_D_<U +00B5>g Vit_D_IU Vit_K_(g) FA_Sat (g) FA_Mono(g) FA_Poly_(g) Cholestrl_(mg) GmWt_1 GmWt_Desc1 GmWt_2 GmWt_Desc2 Refuse_Pct
01001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0 0.06 24 0.02 2 24 24 643 0.09 0.000 0.000 1.0 0 0.005 0.034 0.042 0.110 0.003 3 0 3 3 18.8 0.17 2499 684 671 0 158 0 0 0 2.32 0.0 0 7.0 51.368 21.021 3.043 215 5.00 1 pat, (1" sq, 1/3" high) 14.2 1 tbsp 0
01002 BUTTER,WHIPPED,W/ SALT 16.72 718 0.49 78.30 1.62 2.87 0 0.06 23 0.05 1 24 41 583 0.05 0.010 0.001 0.0 0 0.007 0.064 0.022 0.097 0.008 4 0 4 4 18.8 0.07 2468 683 671 1 135 6 0 13 1.37 0.0 0 4.6 45.390 19.874 3.331 225 3.80 1 pat, (1" sq, 1/3" high) 9.4 1 tbsp 0
01003 BUTTER OIL,ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0 0.00 4 0.00 0 3 5 2 0.01 0.001 0.000 0.0 0 0.001 0.005 0.003 0.010 0.001 0 0 0 0 22.3 0.01 3069 840 824 0 193 0 0 0 2.80 0.0 0 8.6 61.924 28.732 3.694 256 12.80 1 tbsp 205.0 1 cup 0
01004 CHEESE,BLUE 42.41 353 21.40 28.74 5.11 2.34 0 0.50 528 0.31 23 387 256 1146 2.66 0.040 0.009 14.5 0 0.029 0.382 1.016 1.729 0.166 36 0 36 36 15.4 1.22 721 198 192 0 74 0 0 0 0.25 0.5 21 2.4 18.669 7.778 0.800 75 28.35 1 oz 17.0 1 cubic inch 0
01005 CHEESE,BRICK 41.11 371 23.24 29.68 3.18 2.79 0 0.51 674 0.43 24 451 136 560 2.60 0.024 0.012 14.5 0 0.014 0.351 0.118 0.288 0.065 20 0 20 20 15.4 1.26 1080 292 286 0 76 0 0 0 0.26 0.5 22 2.5 18.764 8.598 0.784 94 132.00 1 cup, diced 113.0 1 cup, shredded 0
01006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0 0.45 184 0.50 20 188 152 629 2.38 0.019 0.034 14.5 0 0.070 0.520 0.380 0.690 0.235 65 0 65 65 15.4 1.65 592 174 173 0 9 0 0 0 0.24 0.5 20 2.3 17.410 8.013 0.826 100 28.35 1 oz 144.0 1 cup, sliced 0
01007 CHEESE,CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0 0.46 388 0.33 20 347 187 842 2.38 0.021 0.038 14.5 0 0.028 0.488 0.630 1.364 0.227 62 0 62 62 15.4 1.30 820 241 240 0 12 0 0 0 0.21 0.4 18 2.0 15.259 7.023 0.724 72 28.35 1 oz 246.0 1 cup 0
01008 CHEESE,CARAWAY 39.28 376 25.18 29.20 3.28 3.06 0 NA 673 0.64 22 490 93 690 2.94 0.024 0.021 14.5 0 0.031 0.450 0.180 0.190 0.074 18 0 18 18 NA 0.27 1054 271 262 NA NA NA NA NA NA NA NA NA 18.584 8.275 0.830 93 28.35 1 oz NA NA 0
01009 CHEESE,CHEDDAR 37.02 404 22.87 33.31 3.71 3.09 0 0.48 710 0.14 27 455 76 653 3.64 0.030 0.027 28.5 0 0.029 0.428 0.059 0.410 0.066 27 0 27 27 16.5 1.10 1242 330 330 0 85 0 0 0 0.71 0.6 24 2.4 18.867 9.246 1.421 99 132.00 1 cup, diced 244.0 1 cup, melted 0
01010 CHEESE,CHESHIRE 37.65 387 23.37 30.60 3.60 4.78 0 NA 643 0.21 21 464 95 700 2.79 0.042 0.012 14.5 0 0.046 0.293 0.080 0.413 0.074 18 0 18 18 NA 0.83 985 233 220 NA NA NA NA NA NA NA NA NA 19.475 8.671 0.870 103 28.35 1 oz NA NA 0

Our goal is to give people some insights about choosing healthy foods. We want to focus on the nutrition facts that people (especially people who workout or want to lose weight) are familiar with and care about the most: Energy(Kcal), Fat, Protein, Sugar, Carbohydrates, Fiber, Cholesterol, Sodium, Potassium, Vitamin A, Vitamin C, Calcium, Iron

So we use “dplyr” to clean data. We use string split to extract the category from the food name, and also create an abbreviate name for future use. Then we select the 13 nutrition facts we want to focus on.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
category<-data.frame(do.call(rbind,strsplit(as.character(a$Shrt_Desc),",")))
## Warning in (function (..., deparse.level = 1) : number of columns of result
## is not a multiple of vector length (arg 72)
a$category <- category$X1
a$Abb<-category$X2
a$Abb<-substr(a$Abb,0,10)
c<- a%>% 
  select(Food=Shrt_Desc, Abb,category,
         Energ_Kcal, `Protein_(g)`,`Carbohydrt_(g)`, 
         `Fiber_TD_(g)`,`Sugar_Tot_(g)`, `Lipid_Tot_(g)`,
         `Cholestrl_(mg)`,`Sodium_(mg)`,`Potassium_(mg)`,
         Vit_A_IU,`Vit_C_(mg)`,`Calcium_(mg)`,`Iron_(mg)`)
kable(head(c,10))
Food Abb category Energ_Kcal Protein_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Lipid_Tot_(g) Cholestrl_(mg) Sodium_(mg) Potassium_(mg) Vit_A_IU Vit_C_(mg) Calcium_(mg) Iron_(mg)
BUTTER,WITH SALT WITH SALT BUTTER 717 0.85 0.06 0 0.06 81.11 215 643 24 2499 0 24 0.02
BUTTER,WHIPPED,W/ SALT WHIPPED BUTTER 718 0.49 2.87 0 0.06 78.30 225 583 41 2468 0 23 0.05
BUTTER OIL,ANHYDROUS ANHYDROUS BUTTER OIL 876 0.28 0.00 0 0.00 99.48 256 2 5 3069 0 4 0.00
CHEESE,BLUE BLUE CHEESE 353 21.40 2.34 0 0.50 28.74 75 1146 256 721 0 528 0.31
CHEESE,BRICK BRICK CHEESE 371 23.24 2.79 0 0.51 29.68 94 560 136 1080 0 674 0.43
CHEESE,BRIE BRIE CHEESE 334 20.75 0.45 0 0.45 27.68 100 629 152 592 0 184 0.50
CHEESE,CAMEMBERT CAMEMBERT CHEESE 300 19.80 0.46 0 0.46 24.26 72 842 187 820 0 388 0.33
CHEESE,CARAWAY CARAWAY CHEESE 376 25.18 3.06 0 NA 29.20 93 690 93 1054 0 673 0.64
CHEESE,CHEDDAR CHEDDAR CHEESE 404 22.87 3.09 0 0.48 33.31 99 653 76 1242 0 710 0.14
CHEESE,CHESHIRE CHESHIRE CHEESE 387 23.37 4.78 0 NA 30.60 103 700 95 985 0 643 0.21

Energy

We first look at energy, because energy is the nutrition fact that people care about most. We have a look at the summary statistics of Energy and draw a base R plot of the distribution of energy. We can see the data range from 0 to 902 kilocalories, and the median value of all the food is around 200 kilocalories. It’s interesting to note that the mean is higher than the median, which means the distribution of energy is skewed to the right, and there are probably some food with very high energy which makes the mean value large.

summary(c$Energ_Kcal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    91.0   191.0   226.3   337.0   902.0
plot(c$Energ_Kcal, ylab="Energy(Kcal)")

We also draw a histogram of the energy distirbution. We can see that it is indeed skewed to the right, and most food have energy that are less than 400 kilocalories.

hist(c$Energ_Kcal, xlab="Energy in kilocalories")

Correlation

After having a quick look of energy, we want to see the correlation between energy and other nutrition facts. We the use package “corrplot” to draw a correlation plot among the 13 nutrition facts. The bigger the circle is, or the darker the color is, the stronger correlation there is between the two variables. From this plot, we find that energy is mostly related to fat, and then carbohydrate, sugar, fiber and protein.

library(corrplot)
## corrplot 0.84 loaded
C<-cor(na.omit(c[4:16]))
corrplot(C, method="circle")

Energy and Fat

First, we use the linear model to get the linear regression between energy and fat. As we can see from the summary statistics, the slope is 8.66, which means there is a positive relationship between the amount of energy and fat. The p-value is very small, so we can conclude the slope coefficient is statistically significant.

lr1<-lm(c$Energ_Kcal~c$`Lipid_Tot_(g)`)
summary(lr1)
## 
## Call:
## lm(formula = c$Energ_Kcal ~ c$`Lipid_Tot_(g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -238.37  -71.17  -31.00   43.58  264.13 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       134.86911    1.28748   104.8   <2e-16 ***
## c$`Lipid_Tot_(g)`   8.66505    0.06772   128.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.4 on 8788 degrees of freedom
## Multiple R-squared:  0.6507, Adjusted R-squared:  0.6507 
## F-statistic: 1.637e+04 on 1 and 8788 DF,  p-value: < 2.2e-16

We also use a new package called “hexbin”, which we learned from Internet, and draw a hexbinplot for the regression between energy and fat. From the plot, we can see the distribution, density and also the linear model between energy and fat. Most foods have energy lower than 400 Kcal and 20 g per 100g. And the regression is quite smooth.

library(hexbin)
library(RColorBrewer)
x1<-hexbinplot(c$Energ_Kcal~c$`Lipid_Tot_(g)`,  type=c("r"), col.line = "red", lwd="3", xlab="Lipid(g)", ylab="Energy(Kcal)")
x1

Energy and Carbohydrate

Then we conduct the same analysis with energy and carbohydrate. This time, we still get a positive slope coefficient, so generally, as the amount of carbohydrate goes up in the food, the amount of energy goes up as well. The P-value is still very small, so the result is statistically significant. However, when we look at the hexbinplot, we can see that the distribution is more dispersive. There are some food that have very low amount of carbohydrate, but possess high amount of calories.

lr2<-lm(c$Energ_Kcal~c$`Carbohydrt_(g)`)
summary(lr2)
## 
## Call:
## lm(formula = c$Energ_Kcal ~ c$`Carbohydrt_(g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -465.48 -112.94  -23.06   57.64  743.64 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        158.35884    2.03022   78.00   <2e-16 ***
## c$`Carbohydrt_(g)`   3.07121    0.05781   53.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 8788 degrees of freedom
## Multiple R-squared:  0.2431, Adjusted R-squared:  0.243 
## F-statistic:  2822 on 1 and 8788 DF,  p-value: < 2.2e-16
x2<-hexbinplot(c$Energ_Kcal~c$`Protein_(g)`, type=c("r"), col.line = "red", lwd="3", xlab="Protein(g)", ylab="Energy(Kcal)")
x2

Energy and Sugar

This time, We regress energy on sugar, and again, we get a positive slope. The P-value is again extremely small, so the slope coefficient is again statistically significant.

lr3<-lm(c$Energ_Kcal~c$`Sugar_Tot_(g)`)
summary(lr3)
## 
## Call:
## lm(formula = c$Energ_Kcal ~ c$`Sugar_Tot_(g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -222.94 -132.01  -18.73   82.66  706.27 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       195.7327     2.2137   88.42   <2e-16 ***
## c$`Sugar_Tot_(g)`   4.0290     0.1287   31.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160.3 on 6956 degrees of freedom
##   (1832 observations deleted due to missingness)
## Multiple R-squared:  0.1234, Adjusted R-squared:  0.1233 
## F-statistic: 979.4 on 1 and 6956 DF,  p-value: < 2.2e-16
x3<-hexbinplot(c$Energ_Kcal~c$`Sugar_Tot_(g)`, type=c("r"), col.line = "red", lwd="3", xlab="Sugar(g)", ylab="Energy(Kcal)")
x3

Energy and Protein

Finally, we regress energy on protein. We still get a positive slope, but this time, the slope is only 1.817, which means the regression line isn’t very steep compared to the three previous cases. The p-value of the coefficient is very small, which indicates the result is statistically significant.

lr4<-lm(c$Energ_Kcal~c$`Protein_(g)`)
summary(lr4)
## 
## Call:
## lm(formula = c$Energ_Kcal ~ c$`Protein_(g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205.89 -129.65  -53.48  104.41  696.29 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      205.707      2.646   77.73   <2e-16 ***
## c$`Protein_(g)`    1.817      0.171   10.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168.8 on 8788 degrees of freedom
## Multiple R-squared:  0.01269,    Adjusted R-squared:  0.01257 
## F-statistic: 112.9 on 1 and 8788 DF,  p-value: < 2.2e-16
x4<-hexbinplot(c$Energ_Kcal~c$`Carbohydrt_(g)`,type=c("r"), col.line = "red", lwd="3", xlab="Carbohydrate(g)", ylab="Energy(Kcal)")
x4

Summary

We use the package called “gridextra” to arrange the 4 graphs together to compare the 4 different regressions. We can see that the slopes in regressions on fat and sugar are steeper than the one on protein. So increases in protein wouldn’t increase energy a lot, but increases in carbohydrate and fat would add a lot of calories. As a result, for people who want to lose weight, it’s better to eat food with less carbohydrate and fat.

library(gridExtra)    
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(x1,x2,x3,x4,nrow=2)

Low Calories Food

We want to see what kind of food has the lowest calories, so we use dplyr to group the dataset by category, and then summarize the average calories of each category, and then arrange it in ascending order. We could see that things like salt and sweetener have actually 0 calories, tea and water also have very low energy. We would expect water also has 0 calories, so we look back into the database, and filter the “water” category. We find that there is something like “water with syrup”, which makes the average energy not equal to zero. Other categories in this top 10 lists include some vegetables such as butterbur, watercress, taro shoots and spinach.

c%>% 
  group_by(category)%>%
  summarise(AverageEnergy=mean(Energ_Kcal))%>%
  arrange(AverageEnergy)%>%
  head(10)
## # A tibble: 10 x 2
##    category            AverageEnergy
##    <fct>                       <dbl>
##  1 SALT                         0   
##  2 SWEETENER                    0   
##  3 TEA                          1.50
##  4 WATER                        4.75
##  5 SWANSON BROTH                5.00
##  6 BUTTERBUR                    8.25
##  7 FLUID REPLCMNT              10.0 
##  8 WATERCRESS                  11.0 
##  9 TARO SHOOTS                 12.5 
## 10 NEW ZEALAND SPINACH         12.7

Food Choices by Category

Then we realize that it is actually meaningless to compare all the food, since it doesn’t make any sense to compare the amount of energy in salt and the amount of energy in water. What we should do is to find some categories that are similar and compare them, for example, different kinds of food, vegetables… We want to find which food is healthier than the others within similar categories.

Protein Source

The first thing we think of is the different kinds of meat. Meats are important sources of protein. We use dplyr to filter the data with categories of chicken, lamb, pork, fish, duck and beef. Then, we draw this violin plot of the energy distribution of different kinds of meat. The plot for duck isn’t showing up, but I think it is reasonable, since duck is proabably just not as popular as the other kinds of meat. I search back on our dataset and find that there are actually just 12 food in the duck category, we so don’t have enough observations to come up with a violin plot. Besides, it seems like some pork has very high energy which means it is easy for people to get fat, and lamb has a small and concrete range, and some fish have very low energy.

library(ggplot2)
meat<-c%>%filter(category==c("CHICKEN","LAMB","PORK","FISH","DUCK","BEEF"))
ggplot(meat, aes(category, meat$Energ_Kcal))+geom_violin()+ylab("Energy(Kcal)")

Then, we want to compare the amount of protein and fat in those 6 meat categories.

We use a new package called “reshape2” to organize the data in order to draw a boxblot, which can visually compare the amount of protein and fat within different categories.

From this plot, we can see that fish and duck have relatively low fat and high protein, so they would be better choices for meat. Pork and lamb seem to have relatively low protein and high fat, so you may want to restrict the amount of those meat if you are on diet.

library(reshape2)
meat.m<-melt(data=meat,id.vars = 'category',measure.vars = c("Protein_(g)","Lipid_Tot_(g)"))
ggplot(meat.m)+geom_boxplot(aes(x=category,y=value,color=variable))+scale_y_continuous(limits=c(0,50))
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Bread

Since American loves bread, we want to know which bread is healthier compare to others. We select all bread from the table and use boxplot to get a general view about the protein, fat and carbohydrate in different kinds of bread. We can learn from the plot that bread generally contains high level of carbohydrate, which is not so good for keeping fit.

bread<-c%>%
  filter(category=="BREAD")
bread.m<-melt(data=bread,id.vars = 'category',measure.vars = c("Protein_(g)","Lipid_Tot_(g)","Carbohydrt_(g)"))
ggplot(bread.m)+geom_boxplot(aes(x=category,y=value,color=variable))

So we want to look for bread with relatively low carbonhydrate. We use ggplot to draw a scatter plot between carbohydrate and energy. The bread in the lower left corner are the ones with low carbohydrate and low energy. While adding a layer of geom_text to it, we can see those breads include: wheat bread, blue corn bread and kneel down bread. They may be relatively good choices for you if you are a bread-lover but also want to keep good fitness.

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
bread%>%
  ggplot(aes(x=`Carbohydrt_(g)`,y=Energ_Kcal))+geom_point()+geom_smooth(method="lm")+geom_text(aes(label=Abb))

Vegetables

Finally, we choose to look at some common types of vegetables and see how much nutrients they have. The vegetable categories we select are cabbage, celery, broccoli, carrot, lettuce, kale, spinach, cauliflower.

Vegetables are good sources for fiber and vitamin. We use boxplot to compare the amount of fiber and vitamin C in different kinds of vegetables.

Since the unit of fiber is g, and the unit of Vitamin C is mg, If we draw the graph in the original unit, the plot will be hard to see. So we adjusted for the unit in fiber to be 100mg.

Generally, broccoli and cauliflower all have very high amount of vitamin C, so do some kale. The level of fiber is similar for different categories, and we would say cauliflower and carrot all have pretty high amount of fiber.

cabbage<-c%>%filter(category=="CABBAGE")
celery<-c%>%filter(category=="CELERY")
broccoli<-c%>%filter(category=="BROCCOLI")
carrot<-c%>%filter(category=="CARROTS")
lettuce<-c%>%filter(category=="LETTUCE")
kale<-c%>%filter(category=="KALE")
spinach<-c%>%filter(category=="SPINACH")
cauliflower<-c%>%filter(category=="CAULIFLOWER")

vegetable<-rbind(cabbage, celery, broccoli, carrot, lettuce, kale, spinach, cauliflower)
vegetable<-unite(vegetable,category,Abb,col="Abb2",remove = FALSE)
vegetable.m<-vegetable%>%mutate(Fiber100mg=`Fiber_TD_(g)`*10)%>%melt(id.vars = 'category',measure.vars = c("Fiber100mg","Vit_C_(mg)"))
ggplot(vegetable.m)+geom_boxplot(aes(x=category,y=value,color=variable))
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

We plot the relationship between the amount of vitamin C and fiber. There is a positive relationship, but the slope isn’t very steep. So generally, vegetables that have higher vitamin C also have higher fiber, but it isn’t true for all cases. For example, the ones in the upper left corner or the lower right corner of the plot represents vegetables with high level of vitamin C but low level of fiber or the other way around. And we want to find the kind of vegetables that have both high vitamin C and fiber, so we use geom-text and find the ones that are in the upper right part of the plot:
they are some types of kale, broccoli and cauliflower. This result is similar to the conclusion we get from the boxplot.

vegetable%>%ggplot(aes(x=vegetable$`Fiber_TD_(g)`, y=vegetable$`Vit_C_(mg)`))+geom_point()+geom_smooth(method="lm")+geom_text(aes(label=Abb2))+xlab("Fiber(g)")+ylab("Vitamin C(mg)")
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_text).

Predict Energy

We wonder if we can predict the energy of a food, while knowing some significant nutrition facts like fat, sugar, carbon and protein. So we use “rpart” to draw a decision tree use those variables.

And this is the result of our decision tree. It seems like that sugar is not that significant in this model, since it doesn’t appear in any layer. Generally, if we know the amount of protein, fat, and carbohydrate in a food, we can try to predict its energy using this regression tree.

library(rpart)
rpart_model <- rpart(Energ_Kcal~`Lipid_Tot_(g)`+`Sugar_Tot_(g)`+`Carbohydrt_(g)`+`Protein_(g)`, data=a, method="anova")
library(rpart.plot)
rpart.plot(rpart_model)

KMenas Clutersing

Finally, we did the k-means clustering based on Energy, Protein, Carbohydrate, Fiber, Sugar and Fat.

We first regress the total within sum of square to get the optimal number of clusters, from the graph it seems like 3, or 4 is the best number to pick, since the line becomes flatter there.

d<-na.omit(c)%>%select(Energ_Kcal, `Protein_(g)`, `Carbohydrt_(g)`, `Fiber_TD_(g)`, `Sugar_Tot_(g)`, `Lipid_Tot_(g)`)
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_nbclust(d, kmeans, method="wss")

So we graph the cluster plots with both clusters equal to 3 and 4. Both of them have a lot of observations clustered on the upper right part. I would say 3 is probably a better choice, since in the cluster of 4, the green group is very small, but in the cluster of 3, all clusters are equally big. In addition, in the cluster of 4, there is a huge overlap between the blue and purple group, which is not appealing in doing clustering. Since we would want to separate observations into different clusters, so it is better for different groups to not overlap.

k3<-kmeans(d, centers=3, nstart=25)
k4<-kmeans(d, centers=4, nstart=25)
p3<-fviz_cluster(k3, geom="point", data = d)
p4<-fviz_cluster(k4, geom="point", data = d)
grid.arrange(p3, p4, nrow=1)